\chapter{Description of the provided documents}

This chapter gives an overview of documents that were received from the CEZ, a.s company. It describes the scope of the documents, their size, structure, format. The last section explains the selection of documents on which the research of this thesis will be based and it elaborates for what kind of search queries user might be interested in.

\section{Documents}

The documents are separated into 13 different sets. Each of them contains various types of company documentations: presentations, reports, records, employees profiles, project documentations. The following subsections describe each set.

\subsection{Field of experience summary}

This set contains documents, summarizes knowledge management and it is consisted of 32 files. Most of the documents are PowerPoint presentations. These presentations have many tables, enumeration, short sentences and abbreviations. They contain sometimes inner documents of Word and Excel. Other documents in this set were exported probably from database and accommodate experience description of various events. Each of these documents contain more than one of these events. The presentations have mostly more then 20 slides. The length of the inner word and excel documents vary from 1 page up to 10.

\subsection{Experience records}
\label{sec:Experience}

This is one of the largest sets and has about 250 files. The documents are mostly describing individual record of experiences, observations and researches. The first page summarizes the document in a table. The table contains names of the authors and supervisors, it continues with a short description of an event, keywords, date of origin and many other details. The text is structured in sections with long continuous texts, tables, enumerations. All these documents were created in the Microsoft Word and have mostly more than 5 pages. Some of the documents are in Portable Document Format \emph{(PDF)} and were scanned from a printed version of these Word files. Other documents in this set are various excel forms and tables with names, numbers, shortcuts and phrases.

\subsection{Expert's profiles}

All documents are PowerPoint presentations consisted of one or two slides. The slide holds one large table containing employee's profile, data were probably imported from a database. The profile stores information such as name, ID, position, experience, role, education, references and many other about an employee. There are 13 documents in this set.

\subsection{Substitution program}

This set of documents is about internship programs within the company. The documents reports a transmission of expertise and experience between employees. The documents are divided into groups of Word document files, excel spreadsheets and PDF files. Word and PDF files holds mostly a structured introduction describing the goals of the transmission program. It is followed by a detail report in multiple pages. The Excel spreadsheet documents are forms with predefined slots filled with a description of a program event. 59 documents in this set.

\subsection{Knowledge management concept}

It has only one file, Power point presentation exported into PDF format. The document has 124 slides with many tables, indented short texts, images. The purpose of this document is to present a project.

\subsection{Forms}

The documents are combination of forms and templates used for writing reports and events of various projects. It is a mixture of Word documents and Excel spreadsheets with short coherent texts and it is structured into tables with pre-filled example values. The set has 17 files.

\subsection{Yearly reports}

Presentations, transcripts and documentations of annual projects' reports. Presentations are filled with tables, images, itemizations with short sentences. Transcriptions contain structured list of goals with short descriptions. Documentations are mostly divided into sections with structured longer continuous text spanning into several pages. 10 files in this set.

\subsection{General presentations}

PowerPoint presentations of various projects, events, educations. The structure and length of this presentations vary in each document. They are consisted of tables, itemization, images, no larger continuous text.

\subsection{Reports}

3 Excel spreadsheet reports. Tables with dates, names, values, phrases. No continuous text is presented.

\subsection{Experience records from KM}

Experience records are the same as documents in section \ref{sec:Experience} but obtained from a different source. All documents are in the PDF format exported from the Word document. The set contains 9 large documents.

\subsection{Management documentation}

Documentations of various project written for the needs of management. Textual documents with a summary in the first page followed by sections with longer continuous text. The text is divided into sections containing tables and itemization with texts of various length.

\subsection{General enterprise documentation}

This set of documents is a mash-up of presentations, educational texts and tests used to prepare employees, suppliers and emergency staff to work in restricted areas of the company. The set contains about 300 different files. Every document differs in content, size and structure.

\subsection{Supplier's training}

This set accommodates presentations about training programs for suppliers to be able to work in the CEZ company. The use of tables, itemization, images are typical for this type of documents. There is about 50 files in this set.

\section{Analysis from the NLP perspective}

Natural language processing will be described in the next chapter, but as mentioned in the introduction. The NLP tool receives text on the input and creates linguistic dependency analysis for each sentence. Treex is trained on newspaper articles. The language and structure used in these articles are significantly different compare to common business documentation. It will probably result in an incorrect dependency analysis of sentences. Maybe in the correct one, but due to not well written grammar sentences will make sense for a human but machine might not be able to process it. The common types of the files are Word, Excel, PowerPoint, PDF. Some of the PDF files are printed from other documents, however some of them are scanned and will make text extraction difficult. The purpose of this thesis is to test the possibility of the information extraction from natural language processing. It implies the need to create a representative set of these documents. Therefore each type of these documents has to be included, but not only the different types of the files as .doc, .pdf, .ppt, but also the different structure of them. Documents with large continuous texts, itemized short texts, tables, texts with abbreviations, phrases. This set will be created from these documents:

\begin{itemize}
	\item \textbf{Experience records} - semi-structured large texts with tables, abbreviations, values. Word and PDF types.
	\item \textbf{Expert's profiles} - structured PowerPoint files with short itemized texts, abbreviations.
	\item \textbf{Substitution program} - reports, different file types, different domains, larger texts.
	\item \textbf{Management documentation} - project documentations, larger texts, different domains.	
\end{itemize}

This different types of text structures will make a solid ground for testing the NLP tool capabilities to deal with such texts.

\section{Analysis from the IE perspective}
From the IE perspective the documents contains information about various projects, experience, technologies and events. To test practical usability of the automated IE from these documents, expectations have to be set. Current IE is done manually. Human reads a document, recognizes entities, relations and writes them down. Human is able to understand the context in a text, coreference in sentences and is able to deal with grammatically incorrect sentences and still is able to acquire knowledge from them. Machine on the other hand can work only within the boundaries of a sentence, no context understanding. Grammatically incorrect sentences will probably cause wrong dependency analysis and loss of extracted information. By looking at the documents, human would focus to extract this information from them:

\begin{itemize}
	\item \textbf{Experience records} - Records of various experiences such as new technologies on a work place, experience with machine repairs, suggested improvements and many others. Extraction done by human will focus on who the experience recorded, about what the experience is recorded, affected things and what is the result of these records.
	\item \textbf{Expert's profiles} - Contains information about employees. Their education, experience, in which section they work and what their responsibilities.
	\item \textbf{Substitution program} - Reports transfer of experience between employees. Important is who transfers to whom the experience, what kind of experience it is, what is affected and what is the result.
	\item \textbf{Management documentation} - Contains documentation about various projects. What kind of project it is, who is responsible, details about the project, what it affects and what is the result.
\end{itemize}

To claim that the automated information extraction is good as a manual extraction, the above described knowledge has to be presented in the results. If this can be proved with the current technologies then the companies can abandon full text search engines as being obsolete.